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1.0   ABSTRACT 

An  0^(n'~2)  lower  bound  on  the  area  required  for  VLSI 
implementations  of  permutation  networks  and  concentrators  is 
given.  A  tradeoff  result  for  the  bandwidth  of  packet 
switching  networks  and  the  area  required  to  implement  them 
is  p  rove  n  . 


Keywords:   Communication  networks,  VLSI  complexity 


We  use  the  following  typographical  conventions 
The  symbol  0^  is  used  for  capital  Omega; 
The  symbol  9  is  used  for  capital  Theta; 
The  sign  "  is  used  to  denote  exponentiation. 


2.0   INTRODUCTION 

In  a  recent  paper  [3]  Franklin  describes  a  VLSI 
implementation  of  a  banyan  network  with  n  inputs  and  n 
outputs  that  requires  0(n'"2)  area.  The  previous  best  known 
lower  bound  for  such  implementations  follows  from  a  result 
of  Kung  [6],  and  is  0^(n"2/(lg  n)"2).   We  close  this   gap   in 

he  present  paper  by  providing  a  lower  bound  that  matches 
the  0(n"2)  upper  bound.  The  lower  bound  is  valid  for  any 
hype r concent  rat  or ,  and  can  be  extended  to  apply  to  all 
circuit  switching  networks  described  in  [11] ,  and  in 
particular  to  0^-net  works.  The  proof  uses  the  standard 
bisection  argument  found  in  [13],  as  refined  by  [1].  The 
same   method   yields   an   0^(n"3/2)  lower  bound  on  the  volume 

equired  for  a  3D  implementation  of  these  networks. 

The  bisection  method  is  also  applied  to  the  analysis  of 
packet  switching  networks.  The  relevant  figure  of  merit  for 
these  networks  is  the  system  bandwidth  and  we  prove  that  the 
area  required  to  implement  such  networks  is  quadratic  in  its 
bandwidth . 


As  a  corollary,  we  obtain  a  new  proof  for  the 
0(n"2/(lg  n)"2)  lower  bound  on  the  area  of  a  shuffle 
exchange  network,  and  other  similar  recirculating   networks. 


The    implications    of    these   lower   bounds   to   parallel 
processing  are  discussed. 

3.0   CIRCUIT  SWITCHING  NETWORKS 


We  briefly  review  the  terminology  related  to 
communication  networks.  The  interested  reader  will  find  a 
detailed  introduction  in  the  two  references  [7]  and  [11]. 

The  topology  of  a  communication  network  is  described  by 
a  directed  graph.  The  nodes  of  the  graph  are  either  inputs , 
i.e.  have  in-degree  zero;  outputs ,  i.e.  have  out-degree 
zero;  or  swi  t  ch  es ,  i.e.  have  positive  in-degree  and 
out-degree.  The  degree  of  each  node  is  bounded  by  a  (small) 
constant  independent  of  the  network  size.  A  link  is  created 
between  an  input  and  an  output  by  dedicating  to  that  pair  a 
path  that  connects  them.  Different  pairs  that  are  connected 
concurrently  use  edge  disjoint  paths. 

Networks  are  classified  according  to  the  family  of 
connections  they  are  able  to  support  concurrently.  We  list 
the  following  types  of  networks  in  decreasing  order  of 
power  . 


An  n-permutation  ne  two  rk.  has  n  inputs  and  n  outputs  and 
contains  n  edge  disjoint  paths  connecting  inputs  II, ...In  to 
outputs  Os (  1  ),...,  Os  (n )  respectively,  for  any  permutation  s. 

An  n-super concent  rat  or  has  n  inputs  and  n  outputs  and 
contains  for  any  k  _<  n  and  any  subsets  of  k  inputs  and  k 
outputs  k  edge  disjoint  paths  connecting  the  k  inputs  to  the 
k  outputs,  in  some  order. 

An  n-hyper concent  rat  or  has  n  inputs  and  n  outputs  and 
contains  for  any  k  _<  n  and  any  subset  of  k  inputs  k  edge 
disjoint  paths  connecting  these  inputs  to  the  fixed  set  of 
outputs  01,  ...,0k,  in  some  order. 

An  (n,m,c)-concentrator  has  n   inputs   and  m   outputs, 

where   n  >^  m  ^  c ,   and   for   any  set  of  c  inputs  there  are  c 

edge  disjoint  paths  connecting  these  inputs   to  c   distinct 
out  put s . 

Note  that  any  n-permutation  network  is  an 
n-superconcent rat  or ,  any  n-super concent  rat  or  is  an 
n-hyper concent  rat  or  and  any  n-hyper concent  rat  or  can  be  used 
as  an  (n , m, m) -concent  rat  or  for  any  m  £  n,  by  ignoring  some 
outputs.  Note  also  that  a  simple  information  theoretic 
argument  shows  that  an  n-permutation  network  has  at  least 
0(n  Ig  n)  switches.   On  the  other   hand   it   is   known   that 


n-s uper cone ent rat ors ,  n-hy perconcent r a t ors  and 

(n , m , m) -concent  rat ors  can  be  built  with  a  number  of  switches 
and  lines  linear  in  the  number  of  inputs  and  outputs. 

We  use  the  VLSI  model  proposed  in  [1] •  A  VLSI 
implementation  of  a  communication  network  corresponds  to  an 
embedding  of  the  associated  graph  into  a  convex  planar 
region  R  that  fulfills  the  following  three  conditions: 

1.  Lines  have  minimal  width  1. 

2.  At  most  q  2.  -^  lines  overlap  at  any  point. 

3.  Nodes  have  minimal  area  1. 


We  proceed  now  to  the  bisection  argument.  A  partition 
of  the  nodes  of  a  communication  network  C  is  a  bisection  if 
it  partitions  the  inputs  of  C  into  two  subsets  II  and  12 
such  that  |I1|  <_  |I2|  <_  |I1|  +  1.  The  width  of  a  bisection 
is  defined  to  be  the  number  of  edges  connecting  nodes  that 
are  in  opposite  parts  of  the  partition.  The  minima  1 
bisect  ion  width  of  a  network  is  defined  to  be  the  minimal 
width  of  any  of  its  bisections.  The  following  result  can  be 
proven,  using  exactly  the  same  technique  as  in  [1] . 


LEMMA  1  (bisection  lemma).  let  w  be  the  minimal 
bisection  width  of  the  communication  network  C.  Then  any 
VLSI  implementation  of  C  has  area  0_('w^2)  . 

Sketch  of  the  proof:  Let  C  be  embedded  in  R.  A  chord 
can  be  found  that  is  perpendicular  to  a  diameter  of  R,  and 
that  bisects  the  inputs  of  C.  At  least  w  lines  cross  that 
chord,  so  that  its  length  is  at  least  w/q  (where  q  is  the 
number  of  "layers").  Since  R  is  convex  it  follows  that  the 
area  of  R  is  at  least  c(w/q)'"2,  where  c  =  l/(2pi)  (see  [1]). 

LEMMA  2.  The  minimal  bisection  width  of  an 
n-super concent  rat  or  is  at  least  n/2. 

Proof:   Let  a  bisection  partition   the   inputs   of  the 

network   into  two  sets  II  and  12  and  the  outputs  into  01  and 

02.   We  have  |I1|  >    n/2  and  |I2|  >_  n/2.   Also  l01|  >    n/2  or 

|02|  >    n/2.    Assume  w.l.o.g.   that  |01|  >    nil.       We  have  n/2 

edge  disjoint  paths  connecting   the   inputs   in   12   to  the 

outputs   in  01  and  therefore  at  least  n/2  edges  crossing  the 
bisection. 

COROLLARY  3-  A  VLSI  implementation  of  an 
n-super  concent  rat  or  has  area  0^(n'"2). 


Proof:  Follows  immediately  from  the  previous  two 
lemmas . 

Since  an  n-permu tat  ion  network  is  also  an 
n- super  concent ra t or ,  the  last  lower  bound  applies  to 
permutation  networks  as  well. 


LEMMA   4.    The    minimal    bisection    width    of  an 
n-byperconcentrator  is  at  least  n/4. 

Proof:   Let  a  bisection  partition   the   inputs   of  the 
network   into   II   and   12   and  the^  first  n/2  outputs  of  the 

network  into  01  and  02.   Either   |01|  >_   n/4   or   |02|  >^  n/4. 

Assume   w.l.o.g.    that   |01|  >^  n/4.    There   are   n/2  edge 

disjoint  paths  connecting  the  inputs  in  12  to  the  first  n/2 

outputs   and  therefore  at  least  n/4  edge  disjoint  paths  from 
inputs  in  12  to  outputs  in  01.   The  lemma  now  follows. 

COROLLARY    5.     A    VLSI    implementation     of  an 
n-hyper  concent  rat  or  has  area  0^(n"2). 

LEMMA   6.    The    minimal    bisection    width    of  an 
(n , m, c ) -concent  rat  or  is  at  least  min ( c , n/ 2 ) -m/2 . 


Proof:  Let  a  bisection  partition  the  inputs  and 
outputs  of  the  network  into  the  sets  II,  12  and  01,  02 
respectively.  Assume  w.l.o.g.  that  |01|  _>  m/2.  Since 
|I2|  >_  n/2  there  exist  at  least  min(c,n/2)  edge  disjoint 
paths  connecting  inputs  in  12  to  distinct  outputs.  At  most 
m/2  of  these  outputs  are  in  02,  so  that  at  least 
min (c , n/ 2 ) -m/2  edge  disjoint  paths  are  connecting  inputs 
from  12  to  outputs  from  01.   The  claim  now  follows. 

COROLLARY  7.  A  VLS I  -  imp lement a t ion  of  an 
(  2m,  m,  m) -concent  rat  or  has  areai  0^(m'"2  )  . 

The  nxn  crossbar  is  an  n-p ermu tat  ion  network  that  can 
be  implemented  in  0(n"2)  area  [3].  It  follows  that  all  the 
lower  bounds  given  above  are  matched  by  upper  bounds  of  the 
same  order  . 


It  is  interesting  to  note  that  the  lower  bounds  on  the 
area  required  for  super  concent  rat  or s ,  hype r concent  rat  or s  and 
concentrators  are  quadratic  in  the  number  of  inputs /output s , 
whereas  the  number  of  lines  and  switches  increases  only 
linearly.  Also,  whereas  the  complexity  of 
super concent  rat  or s  and  of  permutation  networks  are 
respectively  0(n)  and  &(n  Ig  n)  when  measured  by  the  number 
of  components,  this  difference  disappears  when  complexity  is 


measured  by  area.  This  is  not  so  surprising:  The  proof  of 
the  existence  of  linear  size  super  concent  rat  or s  uses  a 
noncons t rue t ive  probabilistic  argument.  The  best  explicit 
constructions  of  super concent  rat  or s  have  the  same  component 
complexity  as  permutation  networks  of  the  same  size.  It 
turns  out  that  the  wiring  in  linear  size  supe r concent  rat  or s 
is  sufficiently  "random"  to  preclude  a  VLSI  implementation 
which  is  substantially  better  than  a  VLSI  implementation  of 
a  permutation  network. 

THEOREM  8.  A  VLSI  implementation  of  any  of  the 
following  networks  requires  area  iO(n^2),  where  n  is  the 
number  of  inputs  and  outputs. 

1  .   0^-ne twork  . 

2.  Flip  (Staran)  network. 

3.  Indirect  binary  cube. 


4.   Baseline  network 


5.   SW-banyan  network 


6.   Augmented  data  manipulator 
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Proof:  The  definitions  of  these  networks  and  a 
description  of  their  properties  can  be  found  in  [11] ,  [15] 
and  the  references  there. 

Since  a  (Benes)  permutation  network  can  be  obtained  by 
joining  two  0^-networks  end  to  end,  it  follows  that  a  VLSI 
implementation  a  an  0^-network  requires  0^(n~2)  area. 

The  flip  ,  the  indirect  cube  and  the  baseline  networks 
have  the  same  topology  as  the  0^-network  and  therefore,  the 
same  bound  applies.  -  ^  . 

The  augmented  data  manipulator  graph  contains  the  graph 
of  the  0^-network  as  a  subgraph.  The  same  lower  bound 
applies  again. 

The  SW-banyan  network  with  2x2  switches  has  the  same 
topology  as  the  0-network.  The  proof  for  other  size 
switches  is  similar. 

The  bisection  method  also  applies  to  3-dimensional 
implementations  of  communication  networks.  We  assume  that 
the  network  is  embedded  in  a  convex  region  of  the  space, 
each  line  has  a  cross  section  area  of  at  least  1,  and  no  two 
wires  intersect.  Under  these  conditions  we  have  the 
following  analogue  of  lemma  1: 
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LEMMA  8.  Let  w  be  the  minimal  bisection  width  of  a 
communication  network  C.  Any  3D  implementation  of  C  has 
volume  0^(w"3/2). 

COROLLARY  9.  3D  implementations  of  an 
n-hy per  concent  rat  or ,  a  ( 2n , n , n) -concent  rat  or ,  and  of  each  of 
the  networks  mentioned  in  theorem  8  all  have  volume 
0(n^3/2)  . 

4.0   PACKET  SWITCHING  NETWORKS 

The  networks  discussed  in  the  previous  section  operate 
as  circuit  swi  t  ch  ing  ne  tworks :  A  logical  link  from  an  input 
to  an  output  is  provided  by  dedicating  a  path  that  connects 
the  pair.  An  alternative  mode  of  operation  is  provided  by 
packet  swi  t  ch  ing  networks .  A  logical  link  from  input  to 
output  is  established  through  the  transmission  of  a  message 
(packet)  through  the  network.  A  simplified  model  for  the 
operation  of  a  packet  switching  network  is  provided  by  the 
f o 1 lowing : 

1.  Each  node  has  an  infinite  buffer  to  store  messages 
in  transit.  Each  message  carries  its  destination 
(output)  address,  and  possibly  information  about 
its  history  through  the  network. 
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2.  Messages  are  generated  at  the  inputs,  retransmitted 
at  the  switches,  and  deleted  at  the  outputs. 

3.  At  the  beginning  of  each  cycle  each  node  can 
transmit  one  message  through  each  of  its  output 
lines.  The  message  is  received  at  the  other  end  of 
the  line  at  the  end  of  the  cycle. 

4.  The  decision  on  which  messages  to  forward  at  each 
node  is  either  global  for  a  centrally  controlled 
network  or  local  for  a  decentralized  network.  A 
global  routing  algorithm  takes  into  consideration 
the  current  state  and  the  history  of  the  entire 
network.  A  local  routing  algorithm  is  performed  by 
each  node  individually  i.e.  decisions  are  based 
only  on  the  information  (addresses  and  possibly 
histories)  of  the  messages  currently  and  previously 
stored  at  the  node. 


Clearly,  an  optimal  global  routing  algorithm  can 
guarantee  at  least  the  same  performance  as  an  optimal  local 
routing  algorithm.  On  the  other  hand  it  is  impractical  in 
many  applications  to  have  central  control  and  an  extra 
communication  network  for  control  information.  To  be  on  the 
safe  side,  we  shall  give  lower  bounds  for  the  performance  of 

13 


centrally  controlled  routing  network  and  match  them  whenever 
possible  by  similar  upper  bounds  on  decentralized  routing 
networks,  where  the  routing  algorithm  uses  only  the 
information  available  from  the  messages  currently  stored  at 
the  node  (but  not  information  from  previous  messages).  The 
results  will  be  in  terms  of  tradeoffs  between  area  and 
per  fo  rmance . 

One  possible  performance  measure  for  packet  switching 
networks  is  transit  t  ime ,  the  number  of  cycles  it  takes  for 
a  message  to  travel  from  an  input  to  its  destination  .  This 
measure  fails  to  differentiate  between  an  0-network  with 
log  n  stages  (figure  1)  and  a  one-  stage  recirculating 
shuffle-exchange  network  through  which  messages  are 
recirculated  Ig  n  times  (figure  2):  In  each  of  these 
networks  the  minimal  transit  time  is  Ig  n  and  time-area 
tradeoffs  arguments  yield  for  both  a  lower  bound  on  area  of 
0(n"2/(lg  n)2)   [6, 13] . 
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Figure  1 
^-network  with  8  inputs  and  8  outputs 
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Figure  2 
Shuffle-exchange  network  with  8  inputs  and  8  outputs 
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We   suggest   instead   to   measure   performance  by   the 
network  bandwld  th . 

The  network   bandwidth   roughly   measures   the  average 

number  of  messages  the  network  may  accept  per  cycle,  without 

being  overloaded.   This  corresponds,  in   a   queueing  system 

with   one   server,   to  the  reciprocal  of  the  average  service 
ra  t  e  • 


As  the  service  depends  on  the  distribution  of  the 
origins  and  destinations  of  the  messages,  that  distribution 
must  be  incorporated  into  the  definition  of  the  network 
bandwidth.  We  shall  follow  the  tradition  of  analyzing  the 
network  for  a  uniform  distribution  of  messages  [2,5]  . 

Let  C  be  a  packet  switching  network  with  n  inputs  and  n 
outputs  in  which  messages  are  independently  and  equiprobably 
generated  at  each  input,  and,  at  each  input,  they  are 
independently  and  equiprobably  destinated  to  each  output. 

We  denote  by  r  the  average  arrival  rate  in  the  system, 
that  is  the  expected  number  of  messages  generated  per  cycle. 
The  network  supports  an  average  arrival  rate  of  r  if  there 
exist  a  routing  algorithm  such  that,  given  an  average 
arrival  rate  of  r,  the  expected  transit  time  of  each 
message,   and  the  expected  number  of  messages  stored  at  each 
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node  are  uniformly  bounded.   The  network  has  bandwidth  b   if 
it  supports  all  average  arrival  rates  r  <  b. 

THEOREM  10.  The  minimal  bisection  width  of  a  network 
with  bandwidth  b  is  at  least  b/4. 

Proof:  Let  the  inputs  and  the  outputs  of  a  network  C 
be  partitioned  by  a  bisection  into  II,  12  and  01,  02 
respectively.  Let  r  be  an  average  arrival  rate  supported  by 
the  network.  Assume  w.l.o.g.  that  |02|  >^  |01|.  Then  the 
average  number  of  messages  generated  at  each  cycle  at  inputs 
in  II  with  destinations  in  02  exceeds  r/4.  Thus  at  least 
r/4  messages  must  be  able  to  cross  the  bisection  each  cycle. 
The  theorem  follows. 


COROLLARY  11.  A  VLSI  implementation  of  a  packet 
switching  network  with  n  inputs,  n  outputs,  and  bandwidth  b 
requires  area  0^(n+b"2). 

Proof:  The  b'"2  term  follows  from  the  bisection  lemma 
and  the  previous  theorem.  The  n  term  represents  the  area 
needed  for  the  input  and  output  nodes. 

Since  a  recirculating  shuffle-exchange  network  with  n 
nodes  provides  for  a  bandwidth  of  n/lgn  [5,8] ,  the  above 
result  implies  the  0(n"2/(lg  n)"2)  area  lower  bound  given  in 


[13]  . 

The  area  bandwidth  tradeoffs  of  cor.  11  can  be 
realized:  An  nxn  crossbar  has  area  0(n"2)  [3]  and  bandwidth 
n.  A  network  with  n  inputs,  n  outputs,  bandwidth  kn  and 
area  0((kn)'"2)  can  be  realized  by  connecting  n  inputs  and  n 
outputs  to  k  copies  of  an  nxn  crossbar,  as  illustrated  in 
figure  3  (note  the  counterintuitive  result  that  connecting 
each  node  to  k  copies  of  an  identical  network  requires  k"2 
times  as  much  area  as  when  only  one  network  is  used).  A 
network  with  n  inputs,  n  outputs,  bandwidth  n/k  and  area 
0(n+(n/k)^2)  can  be  realized,  by  building  an  (n/k)x(n/k) 
crossbar,  and  permitting  k  inputs  (outputs)  to  time 
multiplex  each  terminal  of  the  crossbar  (figure  4). 
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4  inputs,  4  outputs  network  with  bandwidth  12 
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Figure  4 
12  inputs,  12  outputs  network  with  bandwidth  4 
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We  have  therefore 

THEOREM  12.  The  minimal  area  of  a  packet  switching 
network  with  n  inputs,  n  outputs  and  bandwidth  b  is 
©(n  +  b''2)  . 

The  crossbar  network  is  by  no  means  the  unique  network 
that  achieves  this  bound.  The  0^-network  with  n  inputs  and  n 
outputs  also  has  a  ba-idwidth  of  n  [5]  and  can  be  implemented 
in  0(n"2)  area  [3].  Networks  with  larger  capacities  can  be 
built  by  replicating  the  0^-network,  and  networks  with  lesser 
capacities  can  be  built  by  time  multiplexing  a  smaller 
0-network.  Alternatively,  one  can  built  networks  with 
smaller  capacities  by  folding  some  of  the  stages  of  the 
0_-network,  and  recirculating  messages  [5].  As  a  special 
case,  we  obtain  the  shuffle  exchange  network. 

The  assumption  that  messages  are  uniformly  distributed 
is  too  restrictive  in  many  cases.  Consider  a  network  where 
each  input  and  each  output  is  connected  to  exactly  one  line. 
Such  a  network  can  support  a  flow  of  messages  only  if 
neither  the  expected  number  of  messages  created  per  cycle  at 
each  input  nor  the  expected  number  of  messages  destinated 
per  cycle  to  each  output,  exceeds  one.  This  restriction  on 
the   distribution   of   messages   is  the  weakest  one  it  makes 
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sense   to   consider.    It   is   captured   by   the    following 
de  f ini  t  ion • 

A  network  has  strong-bandwidth  b  if  it  supports  any 
flow  of  messages  with  a  distribution  that  fulfils  the 
following  two  conditions: 

1.  The  average  arrival  rate  is  less  than  b; 

2.  The  number  of  messages  generated  at  each  cycle  at 
each  input,  and  the  number  of  messages  generated  at 
each  cycle  for  each  output  are  identically, 
independently  distributed. 

We  do  not  require  however  that  messages  generated  at 
one  input  are  equally  destinated  to  each  output.  Note  that 
if  a  network  has  strong-bandwidth  b,  it  also  has  bandwidth  b 
according  to  the  previous  definition.  The  converse  is  not 
true:  an  0^-network  with  n  inputs  and  n  outputs  has 
bandwidth  n  but  strong-bandwidth  £sqrt(n).  Nonetheless,  the 
same  bandwidth  area  tradeoffs  are  obtained  for  this  stronger 
definition  of  bandwidth.  Indeed,  it  is  not  hard  to  see  that 
an  nxn  crossbar  has  strong-bandwidth  n.  Using  the 
techniques  of  theorem  12  we  obtain 
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THEOREM  13.  The  minimal  area  of  a  network  with  n 
inputs,  n  outputs  and  strong-bandwidth  b  is  QCn+b^Z). 

The  same  area  strong-bandwidth  tradeoff  is  realized  by 
the  Benes  permutation  network. 

5.0   UNIVERSAL  ULTRACOMPUTERS 

We  have  considered  bipartite  networks,  with  inputs 
distinct  from  outputs.  The  same  analysis,  and  the  same 
results  can  be  obtained  for  communication  networks  which  are 
not  bipartite:  Each  node  is  both  an  input,  an  output  and  a 
switch,  so  that  messages  are  generated,  retransmitted  or 
deleted  at  each  node.  Such  s  ingle  stage ,  or  recirculating 
networks  [11]  are  a  natural  framework  for  the  study  of 
ultracomputers  [10],  that  is  distributed  systems  of 
processors  working  independently,  but  coordinating  by  the 
exchange  of  messages  through  a  communication  network.  At 
each  node  of  the  network  there  is  a  processor.  Each 
processor  is  connected  to  a  constant  number  of  neighbour 
processors  and  may  initiate  messages,  receive  messages,  and 
retransmit  messages  in  transit. 
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We  say  that  a  recirculating  network  with  n  nodes  has 
weak-bandwidth  b  if  it  supports  any  distribution  of  messages 
such  that 

(1.1)  Messages  are  independently  and  equiprobably 
generated  at  each  input; 

(1.2)  At  each  input  messages  are  independently  and 
equiprobably  destinated  to  each  output; 

(1.3)  The  expected  number  of  messages  generated  at  each 
cycle  is  less  than  b. 

The  network  has  strong-bandwidth  b  if  it  supports  any 
distribution  of  messages  such  that 

(2.1)  Messages  are  independently  and  equiprobably 
generated  at  each  input; 

(2.2)  Messages  are  independently  and  equiprobably 
generated  for  each  output; 

(2.3)  The  expected  number  of  messages  generated  at  each 
cycle  is  less  than  b. 
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THEOREM  14.   (i)  The  minimal  area   of   a   recirculating 
network  with  n  nodes  and  weak-bandwidth  b  is  9(b'"2+n). 
(ii)  The  minimal  area  of   a   recirculating   network   with   n 
nodes  and  strong-bandwidth  b  is  Q(h~2+n)' 

Proof:  (i)  An  0(b"2+n)  lower  bound  can  be  proven  using 
the  techniques  of  th.  12.  As  for  a  matching  upper  bound, 
note  that  a  shuffle-exchange  network  [11]  with  n  nodes  has 
weak-bandwidth  n/lg  n  [5] ,  and  can  be  implemented  in  area 
0(n2/(lg  n)'~2)  [4].  A  network  with  weak-bandwidth  kn/lg  n 
can  be  obtained  by  duplicating  k  time  the  connections  in  the 
shuffle-exchange  network.  The  area  required  will  increase 
by  a  factor  of  k"2.  A  network  with  kn  nodes  and  bandwidth 
n/lg  n  can  be  obtained  by  having  k  nodes  share  each  terminal 
of  a  shuffle-exchange  network  with  n  nodes.  The  area 
required  to  implement  it  is  0(kn+(n/lg  n)"2).  (The  layout 
of  [4]  for  a  shuffle  exchange  graph  with  n  nodes  consists  of 
a  Ig  n  by  n/lg  n  grid  of  nodes,  embedded  into  a  square  of 
side  9(n/lg  n).  Each  node  can  be  replaced  by  a  vertical 
string  of  k  nodes,  and  the  new  structure  be  embedded  into  a 
rectangle  of  size  9(n/lg  n)  by  9(klg  n+n/lg  n).) 
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(ii)  The  same  lower  bound  is  obviously  valid  for 
strong-bandwidth.  A  matching  upper  bound  is  obtained  by 
applying  a  method  due  to  Valiant  [14] :  Let  C  be  a 
recirculating  network  with  n  nodes  and  weak-bandwidth  b.  We 
claim  that  the  network  has  strong-bandwidth  b/2  at  least. 

Let  messages  be  generated  at  the  nodes  of  C  with  a 
distribution  that  fulfils  conditions  2.1  -  2.3,  with  an 
average  arrival  rate  of  b/2.  Use  the  following  routing 
algorithm:  Each  message  is  routed  in  two  stages;  In  the 
first  stage  a  random  destination  is  picked  and  the  message 
is  routed  to  that  destination;  In  the  second  stage  the 
message  is  routed  to  its  correct  destination.  The  traffic 
in  the  network  is  now  identical  to  that  obtained  under  a 
distribution  of  messages  that  fulfils  conditions  1.1  -  1-3, 
with  an  average  arrival  rate  of  b.  The  network  can 
therefore  support  this  traffic. 

A  theoretical  model  of  parallel  machine  which  is  often 
used  is  that  of  a  paracomputer  [10] :  A  large  number  of 
processors  are  executing  independent  instruction  streams; 
All  the  processors  share  a  common  memory  which  they  can 
access  simultaneously  in  a  single  cycle,  provided  that  no 
two  of  them  access  the  same  location.  This  model  ignores 
fan-in  and  fan-out  limitations,  and  cannot  be  implemented  in 
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reality.  An  ul t r acomput er  can  however  simulate  a 
par acomput er  with  the  same  number  n  of  processors:  The 
shared  memory  is  distributed  among  the  processors,  and  an 
access  to  memory  is  simulated  by  sending  a  message  to  the 
node  containing  that  part  of  the  memory,  and  receiving  back 
a  message  with  the  information. 

Since  each  processor  is  connected  to  a  constant  number 
of  neighbours  the  average  length  of  a  minimal  path  between 
two  processors  is  0^(lg  n). 

Thus,  if  the  paracomputer  generates  requests  in  a 
random  pattern,  the  simulation  of  one  paracomputer  step 
takes  on  the  average  0^(lg  n)  ul  t  r  acorap  ut  er  steps.  This 
bound  can  be  achieved  if  the  ul t r acomput er  connection 
network  fulfils  the  following  two  conditions: 

1.  The  network  has  strong-bandwidth  6(n/lg  n). 

2.  In  a  distribution  of  messages  that  is  supported  by 
the  network  the  expected  transit  time  of  each 
message  is  0(lg  n). 
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An  ul t racomput er  fulfilling  these  two  conditions  is 
said  to  be  an  op  t  ima 1  universal  comput  er . 

THEOREM  15.  The  minimal  area  for  an  implementation  of 
an  optimal  universal  computer  with  n  nodes  is 
»(n~2/(lg  n)^2) . 

Proof:  The  lower  bound  follows  readily  from  the 
previous  theorem  and  the  definitions.  Examples  of 
structures  that  achieve  this  bound  are  provided  by  the  Cube 
Connected  Cycles  machine  of  [9]  and  the  perfect  shuffle 
machine  [4,10,12].  -■^"f-  ^'^'■'■^~■■■ 

?  c   n  o  j  ; 
Note  that  an  optimal  universal  computer  is   optimal   in 

the   minmax   sense:    The   worst  case  average  performance  of 

such  machine  is  minimized.   For  most  practical   applications 

ult racomput e rs  can  be  designed  that  outperform  the  universal 

computer  on  this  application. 


6.0   CONCLUSION 

We  have  exhibited  lower  bounds  for  the  area  required  to 
implement  circuit  switching  networks,  and  bandwidth-area 
tradeoffs  for  packet  switching  networks.  One  surprising 
result  is  that  structures  having  very  different  complexities 
when  measured  by  number  of  components  turn  out  to   have   the 
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same  area  complexity  (e.g.  superconcent r a t or  as  compared  to 
permutation  network  and  0^-networks  as  compared  to 
crossbars) . 

Our  analysis  for  packet  switching  networks  has  not 
taken  into  account  the  distribution  of  delays  through  the 
network.  While  this  makes  it  easier  to  obtain  clean 
tradeoff  results,  practical  implementations  have  to  consider 
transit  time  as  well  as  bandwidth.  This  second  parameter 
may  be  used  to  rank  structures  which  have  the  same  bandwidth 
[  5  ]  .  '  ■ 

Finally,  we  believe  that  the  )0((n/lg  n)""2)  lower  bound 
on  the  area  of  an  optimal  universal  computer  is  valid  for 
any  machine  that  can  sort  n  items  in  0(lg  n)  steps  on 
ave  rage  . 
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